-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[Server::Shutdown] Fix the matter server shutdown sequence, introduced few APIs to do *ClusterServerShutdown so that we can reinitialise the matter server without reboot on embedded platforms. #38515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Server::Shutdown] Fix the matter server shutdown sequence, introduced few APIs to do *ClusterServerShutdown so that we can reinitialise the matter server without reboot on embedded platforms. #38515
Conversation
a4d73d5
to
4eddd6e
Compare
59e4055
to
7028444
Compare
PR #38515: Size comparison from 37948e2 to 6df1418 Increases above 0.2%:
Full report (24 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, qpg, stm32, telink)
|
PR #38515: Size comparison from 37948e2 to 720ea1c Increases above 0.2%:
Full report (55 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, esp32, linux, nrfconnect, nxp, psoc6, qpg, stm32, telink, tizen)
|
720ea1c
to
1bcb1ea
Compare
PR #38515: Size comparison from 36c5265 to 1bcb1ea Increases above 0.2%:
Full report (75 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, linux, nrfconnect, nxp, psoc6, qpg, stm32, telink, tizen)
|
1bcb1ea
to
291897f
Compare
PR #38515: Size comparison from 4e8a75c to 291897f Increases above 0.2%:
Full report (75 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, linux, nrfconnect, nxp, psoc6, qpg, stm32, telink, tizen)
|
291897f
to
7e0794c
Compare
b5a34f1
to
22b18df
Compare
PR #38515: Size comparison from 7460361 to 22b18df Increases above 0.2%:
Full report (75 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, linux, nrfconnect, nxp, psoc6, qpg, stm32, telink, tizen)
|
00c52de
to
82ef723
Compare
PR #38515: Size comparison from a7cf007 to 82ef723 Increases above 0.2%:
Full report (3 builds for cc32xx, stm32)
|
82ef723
to
a1c6f42
Compare
PR #38515: Size comparison from d761e1b to a1c6f42 Increases above 0.2%:
Full report (41 builds for cc32xx, cyw30739, efr32, linux, nrfconnect, psoc6, qpg, stm32, telink, tizen)
|
PR #38515: Size comparison from d761e1b to 22eaff4 Increases above 0.2%:
Full report (41 builds for cc32xx, cyw30739, efr32, linux, nrfconnect, psoc6, qpg, stm32, telink, tizen)
|
f3b5876
to
4153b41
Compare
PR #38515: Size comparison from 68fb9b5 to 4153b41 Increases above 0.2%:
Full report (27 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, efr32, qpg, stm32, telink)
|
Co-authored-by: Boris Zbarsky <[email protected]>
server shutdown sequence.
4153b41
to
1d4cf19
Compare
PR #38515: Size comparison from 68fb9b5 to 1d4cf19 Increases above 0.2%:
Full report (75 builds for bl602, bl702, bl702l, cc13x4_26x4, cc32xx, cyw30739, efr32, esp32, linux, nrfconnect, nxp, psoc6, qpg, stm32, telink, tizen)
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical gap in the Matter server shutdown process by introducing a more structured approach to shutting down the data model and associated components. The changes correctly integrate cluster-specific shutdown callbacks and ensure active interactions are cleaned up before core data model structures are dismantled. This is a significant improvement for the stability and correctness of the server shutdown flow.
Summary of Findings
- Incomplete Scenes Cluster Shutdown: The existing shutdown logic in
scenes-server.cpp
and theShutdown()
call in theScenesServer
destructor (scenes-server.h
) have been commented out with TODOs. This leaves the Scenes cluster potentially not cleaning up its persistent storage or other resources correctly during shutdown, which is a high-severity issue for a PR focused on fixing shutdown flows. - Complex Resource Management Logic: Significant logic has been added to
InteractionModelEngine.cpp
for managing resource pools (read/subscribe handlers, paths) based on fabric quotas, including eviction logic. While necessary for resource constrained devices, this is complex and requires thorough testing to ensure correctness and handle edge cases during resource allocation and eviction. - Placeholder Shutdown Callbacks: Empty
Matter*PluginServerShutdownCallback
functions were added for various clusters inutil.cpp
andcamera-av-settings-user-level-management-server.cpp
. These are placeholders and need to be implemented by the respective cluster owners if specific shutdown logic is required for those clusters.
Merge Readiness
The pull request introduces crucial improvements to the server shutdown process. However, the commented-out shutdown logic in the Scenes cluster represents a high-severity issue that needs to be addressed before merging. Once the Scenes cluster shutdown is properly implemented or verified, and the TODOs are resolved, the PR should be ready to merge. I am unable to approve this pull request directly; please have other reviewers approve this code before merging.
// TODO: Remove the commented code. | ||
// MatterScenesManagementClusterServerShutdownCallback was not being called before and in the sceneTable->RemoveEndpoint(); | ||
// it removes the SceneTableEntry from the persistent storage. | ||
// uint16_t endpointTableSize = 0; | ||
// VerifyOrReturn(Status::Success == Attributes::SceneTableSize::Get(endpoint, &endpointTableSize)); | ||
|
||
// // Get Scene Table Instance | ||
// SceneTable * sceneTable = scenes::GetSceneTableImpl(endpoint, endpointTableSize); | ||
// sceneTable->RemoveEndpoint(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing shutdown logic for the Scenes cluster has been commented out, with a TODO indicating it needs to be removed. However, this logic (sceneTable->RemoveEndpoint()
) seems crucial for cleaning up the SceneTableEntry from persistent storage during shutdown. Leaving this commented out means the Scenes cluster might not be cleaning up its resources correctly, which could lead to issues on subsequent server startups or resource leaks. Could you clarify if this cleanup is being handled elsewhere in the new shutdown flow, or if the commented-out code needs to be re-implemented or replaced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR: why are we checking in commented out code?
// TODO: Remove the commented code. | ||
// ScenesServer::Shutdown is being called from MatterScenesManagementPluginServerShutdownCallback. | ||
// Shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the .cpp file, the call to Shutdown()
in the ScenesServer
destructor has been commented out with a TODO. If ScenesServer::Shutdown()
contains necessary cleanup logic beyond what's handled by the new MatterScenesManagementPluginServerShutdownCallback
, removing this call could lead to resource leaks or incorrect state on shutdown. Is the logic previously in ScenesServer::Shutdown()
now handled by the new callback or elsewhere?
Scenes are expected to be persisted on device power cycle, so there might have been a misunderstanding in the Shudtown behavior on my part. Initially, I first wrote the scenes-server shutdown method having dynamic endpoints in mind, meaning the Shutdown would only be called when an endpoint would remove an endpoint. If shutdown is indeed meant to happen between device power cycle, then the scenes-server shutdown logic indeed needs to be moved to a different handler. |
@lpbeliveau-silabs Thank you for the confirmation. Is it okay if I remove that code in this PR? Will you be following up on this separately? |
@@ -340,6 +340,39 @@ class InteractionModelEngine : public Messaging::UnsolicitedMessageHandler, | |||
#endif // CHIP_CONFIG_SUBSCRIPTION_TIMEOUT_RESUMPTION | |||
#endif // CHIP_CONFIG_PERSIST_SUBSCRIPTIONS | |||
|
|||
/** Previously, ShutdowActiveReads() was only available in unit test builds via `#if CONFIG_BUILD_FOR_HOST_UNIT_TEST`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is ShutdowActiveReads?
@@ -340,6 +340,39 @@ class InteractionModelEngine : public Messaging::UnsolicitedMessageHandler, | |||
#endif // CHIP_CONFIG_SUBSCRIPTION_TIMEOUT_RESUMPTION | |||
#endif // CHIP_CONFIG_PERSIST_SUBSCRIPTIONS | |||
|
|||
/** Previously, ShutdowActiveReads() was only available in unit test builds via `#if CONFIG_BUILD_FOR_HOST_UNIT_TEST`. | |||
* However, we now require proper cleanup of active read handlers during the application's shutdown sequence, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we are making this API public. We already shut down the read handlers in InteractionModelEngine::Shutdown
. Why do we need a separate way of doing it? That sounds like something is just broken in our shutdown/startup sequencing.
As far as ReadClients: actual reads get shut down by shutting down sessions/exchanges, and subscriptions are shut down in InteractionModelEngine::Shutdown
already, no? So again, this feels like a problem in our sequencing. But even if not, ShutdownAllSubscriptions
is public API on IM engine.
// Get Scene Table Instance | ||
SceneTable * sceneTable = scenes::GetSceneTableImpl(endpoint, endpointTableSize); | ||
sceneTable->RemoveEndpoint(); | ||
// TODO: Remove the commented code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this code leads to incorrect behavior on shutdown, but not having this code leads to incorrect behavior on endpoint removal (storage leaks). So just commenting it out is breaking critical things, and chances are we need to fix that before we can land this PR.
Have you talked to the authors of this code about this situation, like we discussed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I have talked with the owner @lpbeliveau-silabs. He dropped the comment #38515 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but who is doing that "moved to a different handler" bit and when? Because we can't ship things in the state that we will have after this PR....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to catch up with the changes and address this shortly. So to confirm, this specific behavior is for embedded systems that shutdown endpoints during power cycling, but don't necessarily remove them? Just want to make sure I got this right before implementing it. Will open a PR when I figure out what this "different handler" might be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into the SDK, it seems like the previous intended behavior of Shutdown WAS to handle endpoints being removed. This PR is modifying this behavior and I do not think that this is a "Scenes is holding back the PR" situation. The distinction between Endpoint Shutdown vs Removal should either be added to the handler or we need to have some code responsible for cleanup on endpoint removal implemented along this change.
* This is required before resetting the DataModelProvider, as ongoing reads may access it | ||
* and lead to crashes or undefined behavior. | ||
**/ | ||
app::InteractionModelEngine::GetInstance()->ShutdownActiveReads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again: during startup we first set the DM provider (line 320 in this diff), then do InteractionModelEngine::Init (line 393), right?
So during shutdown we should be doing InteractionModelEngine::Shutdown before clearing the DM provider, I would think. That's what #38515 was about. But for some reason, instead of moving SetDataModelProvider(nullptr)
to the right place in the shutdown sequence we ended up adding this extra shutdown step....
It's not just reads that touch the DM provider, anyway (commands and writes do too), so you really have to InteractionModelEngine::Shutdown before clearing the provider.
If we did that, InteractionModelEngine::Shutdown would clear any active read handlers, as it already does. And clear subscription clients. Which would leave read clients.... Those would get shut down when we expire their sessions, but if we want to more proactively expire them, we could do that in the IM engine shutdown. Either way, read clients don't touch DM provider, right?
It seems only Scenes was relying on this, but I think other clusters should clean their storage on endpoint departure (groups for instance). Do we have a replacement mechanism for dynamical endpoint shutdown, as in going away, vs the shutdown we are implementing in this PR? |
Problem
VerifyOrDie
checks at the time of emberAfClusterInitCallbacks for almost every cluster.SetDataModelProvider(nullptr)
and cluster shutdown callbacks were not being called at all.SetDataModelProvider
API and the API to close active read handlers was private for unit testing but it should happen before weSetDataModelProvider(nullptr)
in theServer::Shutdown()
.Changes
ShutdownDataModelForTesting
which consumesMatter*PluginServerShutdownCallback
toUnregister the AAI and CHI from the cluster itself and also remove the registered fabric delegate
andemberAf*ClusterShutdownCallback
to delete the cluster instance/delegate pointers so that we can reinit it on next server initialisation sequence and device will not crash onVerifyOrDie
.SetDataModelProvider(nullptr)
in theServer::Shutdown()
so that actual cluster shutdown APIs being called.ShutdownActiveReads()
publicly and integrate into standard shutdown sequence.RemoveEndpoint()
from scenes cluster shutdown callback so that we can have persistent scenes across devices power cycle.Testing
Server::Shutdown
API by having multiple Shutdown and Init cycles in the application using ESP32 to verify device do not crash at the time of initialisations due to any reason.